Introduction

In this technical report, I am showing my step-by-step strategy for predicting whether a firm will have fast growth or not.

Adding necessary libraries and data. This initial data has 287829 observations and 48 variables.

Let's start with preliminary exploration. Some variables with lots of NAs (>80%) can be automatically excluded.

to_filter <- sapply(data, function(x) sum(is.na(x)))
to_filter[to_filter > 0]
##              COGS             amort       curr_assets         curr_liab 
##            269572              8040               131               131 
##         extra_exp         extra_inc extra_profit_loss     finished_prod 
##             18529             18529             17203            270344 
##      fixed_assets       inc_bef_tax     intang_assets       inventories 
##               131              7437               140               131 
##        liq_assets      material_exp     net_dom_sales     net_exp_sales 
##               131              8040            269572            269572 
##     personnel_exp  profit_loss_year             sales          share_eq 
##              8040              9884              7437               131 
##    subscribed_cap       tang_assets             wages                 D 
##               131              1470            269846            287829 
##      founded_year         exit_year         ceo_count           foreign 
##             56457            248970             56427             56427 
##            female        birth_year     inoffice_days         nace_main 
##             56427            111818             56427              1038 
##              ind2               ind      founded_date         exit_date 
##              1038              9769                51            231649 
##         labor_avg 
##            146532
## variables with more than 80% of NAs: COGS, finished_prod, net_dom_sales, net_exp_sales, wages
## they can be deleted

data <- data %>%
  select(-c(COGS, finished_prod, net_dom_sales, net_exp_sales, wages))

Sample Design

The task requires to work with data from 2010 to 2015. I also decided to focus on small and medium size enterprises (up to 50 million euro sales according to the European Commission's definition), because very large firms may have different potential for growth (and very large sale numbers - the dependent variable that I use in this analysis, but more information is presented in the next section).

data_filtered <- data %>%
  filter(year %in% c(2010:2015))
NROW(unique(data_filtered$comp_id))
## [1] 39375
## 167606 observations now, 39375 unique companies

data_filtered <- data_filtered %>%
  filter(sales >= 1000 & sales <= 50000000)
NROW(unique(data_filtered$comp_id))
## [1] 33703
## 129491 observations now, 33703 unique companies

Label Engineering

The main dependent variable is a fast growth of firms. It can be defined in different ways: as changes in incomes, changes in current assets, total sales. I am more inclined to use a total sales indicator, because it is an economic variables that allows easier understanding of the firms' development: if it is successful and growing, then firms will sale more production. Incomes do not always allow to reflect the firms' growth: it can sell its assets/ fire workers and receive more incomes / reduce expenditures this way, but it is usually happenning not because of the company's growth.

For my analysis, I am taking a percentage difference in total sales (percentage allows to take into consideration different firms sizes / different types of products). The timeframe is two years: a percentage difference in sales between 2012 and 2014, in order to reduce possible one-year fluctiations and look at a wider time range. A disadvantage of sales as a dependent variable is that many other exogenous confounders can affect it (for example, economic or political crisis leads to lower incomes and reduced purchasing power, which directly affects sales); however, let's assume that non of similar significant events happened between 2012 and 2014 in the United States (which is quite true in reality, I believe).

data_filtered_wide <- data_filtered %>% 
  pivot_wider(id_cols = c(comp_id, year), names_from = year, values_from = sales) %>%
  mutate(total_change = `2014` - `2012`,
         perc_change = total_change/`2012`*100)

summary(data_filtered_wide$perc_change)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    -99.8    -17.5     12.8    196.8     58.1 509617.1    16626

There were some NAs (16626 firms - since transformed to wide data), so I decided to impute in cases when it was possible.

  1. In some cases, there is data for 2011 and 2013 but not for 2012, so I imputed missings in 2012 by averaging sales in surrounding years.
  2. In some cases, there is data for 2013 and 2014 but not for other years. In this situation, I decided to use 2013 as a reference year (2011 is too far from 2014) (since my criterium for defining growth will be quite tough (q3), only a really huge percentage change between these two consequent years will be defined as growth).
  3. Following similar rules, 2014 was imputed by either averaging 2013 and 2015 OR taking 2015 as a final year instead of 2014.
  4. Also, I am deleting rows if there are missings in years 2011-2013 or 2013-2015, because I cannot impute them and make any relevant predictions.
  5. All imputed missings are flagged.
data_filtered_wide <- data_filtered %>% 
  pivot_wider(id_cols = c(comp_id, year), names_from = year, values_from = sales) %>% 
  mutate(imputed = ifelse(is.na(`2012`) == TRUE | is.na(`2014`) == TRUE, 1, 0)) %>%
  mutate(`2012` = ifelse(is.na(`2012`) == TRUE, ifelse(is.na(`2011`) == FALSE & is.na(`2013`) == FALSE, (`2011` + `2013`)/2, NA), `2012`),
         `2012` = ifelse(is.na(`2012`) == TRUE, ifelse(is.na(`2011`) == TRUE & is.na(`2013`) == FALSE, `2013`, NA), `2012`),
         `2014` = ifelse(is.na(`2014`) == TRUE, ifelse(is.na(`2013`) == FALSE & is.na(`2015`) == FALSE, (`2013` + `2015`)/2, NA), `2014`),
         `2014` = ifelse(is.na(`2014`) == TRUE, ifelse(is.na(`2013`) == TRUE & is.na(`2015`) == FALSE, `2015`, NA), `2014`)) %>%
  mutate(total_change = `2014` - `2012`,
         perc_change = total_change/`2012`*100)
summary(data_filtered_wide$perc_change)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   -100.0    -15.9     15.9    247.2     72.8 509617.1    14041

A number of missing firms was reduced to 14041. Since I do not think it is possible to impute further, all other NAs are excluded from my analysis. Filtered dataset (back to long format) contains 102932 observations and 19662 unique firms.

data_filtered <- data_filtered_wide %>%
  select(comp_id, imputed, total_change, perc_change) %>%
  right_join(data_filtered, by = "comp_id")

data_filtered <- data_filtered[!is.na(data_filtered$perc_change),]
summary(data_filtered$perc_change)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   -100.0    -16.4     13.5    203.6     60.8 509617.1
NROW(unique(data_filtered$comp_id))
## [1] 19662
## 102932 observations now, 19662 firms
quantile(data_filtered$perc_change, probs = c(0.65))
##      65% 
## 35.02251
data_filtered %>%
  filter(perc_change <= 1000) %>%
  ggplot(aes(x = perc_change)) + geom_density() +
  geom_vline(xintercept = 35, color = "blue")

Finally, I create a fast growth dummy - the main dependent variable of my analysis. Initially, I wanted to use a 3 quartile to define a fast growth company, which is 60. However, I believe that 40 or 50 percent increase in sales over two years can also be defined as fast. Thus, I am using a 65 percentile instead of a 3 quartile (= 35% growth in sales). It allows me to distribute data in a decently equal way (65% to 35%), and from the empirical perspective, I believe that 35% growth in sales over two years can be classified as fast. 7432 firms are classified as fast-growing, 12230 as not fast-growing.

# creating fast growth binary 
data_filtered <- data_filtered %>%
  mutate(fast_growth = ifelse(perc_change >= 35, 1, 0))

data_filtered$fast_growth_f <- as.factor(data_filtered$fast_growth)

summary(data_filtered$fast_growth)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3502  1.0000  1.0000
table(data_filtered$fast_growth_f)
## 
##     0     1 
## 66882 36050
data_filtered %>%
  select(comp_id, fast_growth) %>%
  unique() %>%
  group_by(fast_growth) %>%
  dplyr::summarize(n_firms = n())
## # A tibble: 2 × 2
##   fast_growth n_firms
##         <dbl>   <int>
## 1           0   12230
## 2           1    7432

Feature Engineering

First, I decided to distribute independent variables in several categories.

  1. Firm's Performance
  2. Human Resources
  3. Geography
  4. Fixed Effects and other relevant characteristics

  5. Firm's Performance: "ln_amort", "amort_flag", "ln_curr_assets", "ln_fixed_assets", "ln_intang_assets", "ln_liq_assets", "ln_tang_assets", "flag_asset_problem", "ln_curr_liab", "ln_extra_exp", "ln_extra_inc", "inc_bef_tax_st", "ln_inventories", "ln_material_exp", "ln_personnel_exp", "profit_loss_year_st", "ln_subscribed_cap", "flag_error"

In many cases, I use natural logarithms instead of real values, because it helps to neutralise the effect of large values and make distributions look more Gaussian. Also, some financial indicators (for example, assets) cannot have values below 0, that is why I imputed them and flagged this imputation. Furthemore, I standartised and winsorised two variables (income before taxes and profit_loss_year). All NAs (which number usually was small for all variables) were imputed.

## amort
data_filtered %>%
  ggplot(aes(x = amort)) + geom_density()

summary(data_filtered$amort) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -57800     259    1374   18662    5681 6597160     199
# 199 NAs - not so many, so can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(amort =  ifelse(is.na(amort), median(amort, na.rm = T), amort),
         amort_flag = ifelse(amort < 0, 1, 0),
         amort = ifelse(amort < 0, 0, amort),
         ln_amort = log(amort + 1))
data_filtered %>%
  ggplot(aes(x = ln_amort)) + geom_density()

summary(data_filtered$flag_error)
## Length  Class   Mode 
##      0   NULL   NULL
# flagging if assets < 0

data_filtered <- data_filtered %>%
  mutate(flag_asset_problem =  ifelse(intang_assets < 0 | curr_assets < 0 | fixed_assets < 0 | liq_assets < 0 | tang_assets < 0, 1, 0))

## curr_assets
data_filtered %>%
  ggplot(aes(x = curr_assets)) + geom_density()

summary(data_filtered$curr_assets) 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##    -34600      4711     15530    206644     56000 141101504        19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(curr_assets =  ifelse(is.na(curr_assets), median(curr_assets, na.rm = T), curr_assets),
         curr_assets = ifelse(curr_assets < 0, 0, curr_assets),
         ln_curr_assets = log(curr_assets + 1))
data_filtered %>%
  ggplot(aes(x = ln_curr_assets)) + geom_density()

## fixed_assets
data_filtered %>%
  ggplot(aes(x = fixed_assets)) + geom_density()

summary(data_filtered$fixed_assets) 
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.       NA's 
##      -7341        663       8374     305925      61241 2696748032         19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(fixed_assets =  ifelse(is.na(fixed_assets), median(fixed_assets, na.rm = T), fixed_assets),
         fixed_assets = ifelse(fixed_assets < 0, 0, fixed_assets),
         ln_fixed_assets = log(fixed_assets + 1))
data_filtered %>%
  ggplot(aes(x = ln_fixed_assets)) + geom_density()

## intang_assets
data_filtered %>%
  ggplot(aes(x = intang_assets)) + geom_density()

summary(data_filtered$intang_assets) 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      -56        0        0     6335        0 10483637       19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(intang_assets =  ifelse(is.na(intang_assets), median(intang_assets, na.rm = T), intang_assets),
         intang_assets = ifelse(intang_assets < 0, 0, intang_assets),
         ln_intang_assets = log(intang_assets + 1))
data_filtered %>%
  ggplot(aes(x = ln_intang_assets)) + geom_density()

## liq_assets
data_filtered %>%
  ggplot(aes(x = liq_assets)) + geom_density()

summary(data_filtered$liq_assets) 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   -83589      719     3111    49712    13900 90536320       19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(liq_assets =  ifelse(is.na(liq_assets), median(liq_assets, na.rm = T), liq_assets),
         liq_assets = ifelse(liq_assets < 0, 0, liq_assets),
         ln_liq_assets = log(liq_assets + 1))
data_filtered %>%
  ggplot(aes(x = ln_liq_assets)) + geom_density()

## tang_assets
data_filtered %>%
  ggplot(aes(x = tang_assets)) + geom_density()

summary(data_filtered$tang_assets) 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##    -10896       574      7593    244436     56593 120959248        19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(tang_assets =  ifelse(is.na(tang_assets), median(tang_assets, na.rm = T), tang_assets),
         tang_assets = ifelse(tang_assets < 0, 0, tang_assets),
         ln_tang_assets = log(tang_assets + 1))
data_filtered %>%
  ggplot(aes(x = ln_tang_assets)) + geom_density()

# flagging if other non-negative variables < 0

data_filtered <- data_filtered %>%
  mutate(flag_error =  ifelse(curr_liab < 0 | extra_exp < 0 | extra_inc < 0 | inventories < 0 | material_exp < 0 | personnel_exp < 0 | subscribed_cap < 0, 1, 0))

# curr_liab
data_filtered %>%
  ggplot(aes(x = curr_liab)) + geom_density()

summary(data_filtered$curr_liab) 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##    -22778      3874     16367    160693     57311 137662528        19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(curr_liab =  ifelse(is.na(curr_liab), median(curr_liab, na.rm = T), curr_liab),
         curr_liab = ifelse(curr_liab < 0, 0, curr_liab),
         ln_curr_liab = log(curr_liab + 1))
data_filtered %>%
  ggplot(aes(x = ln_curr_liab)) + geom_density()

## extra_exp
data_filtered %>%
  ggplot(aes(x = extra_exp)) + geom_density()

summary(data_filtered$extra_exp) 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##  -133585        0        0     1599        0 13918115        3
# 3 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(extra_exp =  ifelse(is.na(extra_exp), median(extra_exp, na.rm = T), extra_exp),
         extra_exp = ifelse(extra_exp < 0, 0, extra_exp),
         ln_extra_exp = log(extra_exp + 1))
data_filtered %>%
  ggplot(aes(x = ln_extra_exp)) + geom_density()

## extra_inc
data_filtered %>%
  ggplot(aes(x = extra_inc)) + geom_density()

summary(data_filtered$extra_inc) 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##  -274407        0        0     3295        0 15686481        3
# 3 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(extra_inc =  ifelse(is.na(extra_inc), median(extra_inc, na.rm = T), extra_inc),
         extra_inc = ifelse(extra_inc < 0, 0, extra_inc),
         ln_extra_inc = log(extra_inc + 1))
data_filtered %>%
  ggplot(aes(x = ln_extra_inc)) + geom_density()

## inc_bef_tax
data_filtered %>%
  ggplot(aes(x = inc_bef_tax)) + geom_density()

summary(data_filtered$inc_bef_tax) 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -17102708     -3570       889     21334      7459  13840463
# in this case, I decided to standartise and then winsorized it
data_filtered <- data_filtered %>%
  mutate(inc_bef_tax_st = (inc_bef_tax - mean(inc_bef_tax))/sd(inc_bef_tax))

data_filtered$inc_bef_tax_st <- Winsorize(data_filtered$inc_bef_tax_st, minval = NULL, maxval = NULL, probs = c(0.05, 0.95),
                  na.rm = FALSE, type = 7)

data_filtered %>%
  ggplot(aes(x = inc_bef_tax_st)) + geom_density()

## inventories
data_filtered %>%
  ggplot(aes(x = inventories)) + geom_density()

summary(data_filtered$inventories) 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    -2607        0     1589    59118     9252 23653078       19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(inventories =  ifelse(is.na(inventories), median(inventories, na.rm = T), inventories),
         inventories = ifelse(inventories < 0, 0, inventories),
         ln_inventories = log(inventories + 1))
data_filtered %>%
  ggplot(aes(x = ln_inventories)) + geom_density()

## material_exp
data_filtered %>%
  ggplot(aes(x = material_exp)) + geom_density()

summary(data_filtered$material_exp) 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -1097800    12922    40744   345940   127715 44773684      199
# 199 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(material_exp =  ifelse(is.na(material_exp), median(material_exp, na.rm = T), material_exp),
         material_exp = ifelse(material_exp < 0, 0, material_exp),
         ln_material_exp = log(material_exp + 1))
data_filtered %>%
  ggplot(aes(x = ln_material_exp)) + geom_density()

## personnel_exp
data_filtered %>%
  ggplot(aes(x = personnel_exp)) + geom_density()

summary(data_filtered$personnel_exp) 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##  -163367     4874    13533    99262    38600 23893608      199
# 199 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(personnel_exp =  ifelse(is.na(personnel_exp), median(personnel_exp, na.rm = T), personnel_exp),
         personnel_exp = ifelse(personnel_exp < 0, 0, personnel_exp),
         ln_personnel_exp = log(personnel_exp + 1))
data_filtered %>%
  ggplot(aes(x = ln_personnel_exp)) + geom_density()

## profit_loss_year
data_filtered %>%
  ggplot(aes(x = profit_loss_year)) + geom_density()

summary(data_filtered$profit_loss_year) 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## -17102708     -3663       444      8981      4504  13827796        25
# 25 NAs - not so many, so again can impute with a median; standartisation and winsorising

data_filtered <- data_filtered %>%
  mutate(profit_loss_year =  ifelse(is.na(profit_loss_year), median(profit_loss_year, na.rm = T), profit_loss_year),
         profit_loss_year_st = (profit_loss_year - mean(profit_loss_year))/sd(profit_loss_year))

data_filtered$profit_loss_year_st <- Winsorize(data_filtered$profit_loss_year_st, minval = NULL, maxval = NULL, probs = c(0.05, 0.95),
                                          na.rm = FALSE, type = 7)

data_filtered %>%
  ggplot(aes(x = profit_loss_year_st)) + geom_density()

## subscribed_cap
data_filtered %>%
  ggplot(aes(x = subscribed_cap)) + geom_density()

summary(data_filtered$subscribed_cap) 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##     -185     1852     6467    66764    11111 46222532       19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it

data_filtered <- data_filtered %>%
  mutate(subscribed_cap =  ifelse(is.na(subscribed_cap), median(subscribed_cap, na.rm = T), subscribed_cap),
         subscribed_cap = ifelse(subscribed_cap < 0, 0, subscribed_cap),
         ln_subscribed_cap = log(subscribed_cap + 1))
data_filtered %>%
  ggplot(aes(x = ln_subscribed_cap)) + geom_density()

## set of numeric variables
financial_perform <- c("ln_amort", "amort_flag", "ln_curr_assets", "ln_fixed_assets", "ln_intang_assets", "ln_liq_assets", "ln_tang_assets", "flag_asset_problem", "ln_curr_liab", "ln_extra_exp", "ln_extra_inc", "inc_bef_tax_st", "ln_inventories", "ln_material_exp", "ln_personnel_exp", "profit_loss_year_st", "ln_subscribed_cap", "flag_error")
  1. Human Resources: "ceo_count", "ceo_count_flag", "foreign", "female", "inoffice_days", "gender", "origin"
## set of variables related to the people

### ceo_count
data_filtered %>%
  ggplot(aes(x = ceo_count)) + geom_bar()

summary(data_filtered$ceo_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   1.000   1.294   2.000  15.000    7944
# let's unite all categories with 3 and more CEOs together; for NAs, assume it equals to the median (1 CEO) + flag them
data_filtered <- data_filtered %>%
  mutate(ceo_count_flag = ifelse(is.na(ceo_count), 1, 0),
         ceo_count = ifelse(is.na(ceo_count), median(ceo_count, na.rm = T), ceo_count),
         ceo_count = as.factor(ifelse(ceo_count >= 3, 3, ceo_count)))

data_filtered %>%
  ggplot(aes(x = ceo_count)) + geom_bar()

### foreign CEO share
data_filtered %>%
  ggplot(aes(x = foreign)) + geom_density()

summary(data_filtered$foreign)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.103   0.000   1.000    7944
# let's transform NAs to mean (since it is share) and flag them
data_filtered <- data_filtered %>%
  mutate(foreign_flag = ifelse(is.na(foreign), 1, 0),
         foreign = ifelse(is.na(foreign), mean(foreign, na.rm = T), foreign))


data_filtered %>%
  ggplot(aes(x = foreign)) + geom_bar()

### female CEO share
data_filtered %>%
  ggplot(aes(x = female)) + geom_density()

summary(data_filtered$female)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.251   0.500   1.000    7944
# in the same way, let's transform NAs to mean (since it is share) and flag them
data_filtered <- data_filtered %>%
  mutate(female_flag = ifelse(is.na(female), 1, 0),
         female = ifelse(is.na(female), mean(female, na.rm = T), female))

data_filtered %>%
  ggplot(aes(x = female)) + geom_bar()

# in the same way, let's transform NAs to median and flag them
data_filtered <- data_filtered %>%
  mutate(birth_year_flag = ifelse(is.na(birth_year), 1, 0),
         birth_year = ifelse(is.na(birth_year), mean(birth_year, na.rm = T), birth_year),
         birth_year = floor(birth_year),
         birth_year = as.factor(birth_year))

data_filtered %>%
  ggplot(aes(x = birth_year)) + geom_bar()

### inoffice_days
data_filtered %>%
  ggplot(aes(x = inoffice_days)) + geom_bar()

# table(data_filtered$inoffice_days)
summary(data_filtered$inoffice_days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      10    1858    2718    3108    3706   10320    7944
# in the same way, let's transform NAs to median and flag them
data_filtered <- data_filtered %>%
  mutate(inoffice_days_flag = ifelse(is.na(inoffice_days), 1, 0),
         inoffice_days = ifelse(is.na(inoffice_days), mean(inoffice_days, na.rm = T), inoffice_days))

data_filtered %>%
  ggplot(aes(x = inoffice_days)) + geom_bar()

### gender
data_filtered %>%
  ggplot(aes(x = gender)) + geom_bar()

table(data_filtered$gender)
## 
##        female   male    mix 
##   7944  18163  65341  11484
summary(data_filtered$gender)
##    Length     Class      Mode 
##    102932 character character
# here, I think NAs can be used as a reference category
data_filtered$gender <- fct_recode(data_filtered$gender, "zero" = "")
data_filtered <- data_filtered %>%
  mutate(gender = as.factor(gender),
         gender = relevel(gender, ref = "zero")) 

### origin
data_filtered %>%
  ggplot(aes(x = origin)) + geom_bar()

table(data_filtered$origin)
## 
##          Domestic  Foreign      mix 
##     7944    83730     8414     2844
summary(data_filtered$origin)
##    Length     Class      Mode 
##    102932 character character
# NAs also can be used as a reference category
data_filtered$origin <- fct_recode(data_filtered$origin, "zero" = "")
data_filtered <- data_filtered %>%
  mutate(origin = as.factor(origin),
         origin = relevel(origin, ref = "zero"))

hr <- c("ceo_count", "ceo_count_flag", "foreign", "female",
        "inoffice_days", "gender", "origin")
  1. Geography of the firm's HQ: urban_m, region_m
### urban_m
data_filtered %>%
  ggplot(aes(x = urban_m)) + geom_bar()

data_filtered <- data_filtered %>%
  mutate(urban_m = as.factor(urban_m),
         urban_m = fct_recode(urban_m, "capital" = "1", "other big" = "2", "other" = "3"))

data_filtered %>%
  ggplot(aes(x = urban_m)) + geom_bar()

### region_m
data_filtered %>%
  ggplot(aes(x = region_m)) + geom_bar()

table(data_filtered$region_m) # let's also use NAs as references
## 
##         Central    East    West 
##     321   61143   25426   16042
data_filtered <- data_filtered %>%
  mutate(region_m = factor(region_m, levels = c("Central", "East", "West"))) %>%
  drop_na(region_m) # delete NAs

data_filtered %>%
  ggplot(aes(x = region_m)) + geom_bar()

geography <- c("urban_m", "region_m")
  1. Fixed Effects and other relevant characteristics: years, firm's industry code (modified ind2), imputed DV
data_filtered %>%
  ggplot(aes(x = year)) + geom_bar()

data_filtered$year <- as.factor(data_filtered$year)
summary(data_filtered$year)
##  2010  2011  2012  2013  2014  2015 
## 13753 15438 17309 19247 19241 17623
data_filtered %>%
  ggplot(aes(x = ind2)) + geom_bar()
## Warning: Removed 34 rows containing non-finite values (stat_count).

## too many categories, some are too specific and small - need to make them "more common"; 34 NAs can create another category
data_filtered <- data_filtered %>%
  mutate(ind2_cat = ind2 %>%
           ifelse(. > 56, 60, .)  %>%
           ifelse(. < 26, 20, .) %>%
           ifelse(. < 55 & . > 35, 40, .) %>%
           ifelse(. == 31, 30, .) %>%
           ifelse(is.na(.), 99, .)
           )
data_filtered$ind2_cat <- as.factor(data_filtered$ind2_cat)
table(data_filtered$ind2_cat)
## 
##    20    26    27    28    29    30    32    33    40    55    56    60    99 
##   258  5860  3580 10592  1687   909   746 10076  1046 11482 55199  1142    34
data_filtered %>%
  ggplot(aes(x = imputed)) + geom_bar()

data_filtered$imputed <- as.factor(data_filtered$imputed)
summary(data_filtered$imputed)
##     0     1 
## 94579  8032
## set of basic variables
fe <- c("year", "ind2_cat", "imputed")

Finally, I am adding some polynomials and interactions.

# sq is needed
ggplot(data = data_filtered, aes(x = ln_amort, y = as.numeric(fast_growth))) +
  geom_point(size = 2,  shape = 20, stroke=2, fill = "blue", color = "blue") +
  geom_smooth(method = "lm", formula = y ~ poly(x,2), color = color[4], se = F, size = 1)+
  geom_smooth(method = "loess", se = F, colour = color[5], size = 1.5, span = 0.9) +
  labs(x = "ln_amort", y = "fast_growth") +
  theme_bg()
## `geom_smooth()` using formula 'y ~ x'

# sq is needed
ggplot(data = data_filtered, aes(x = ln_curr_assets, y = as.numeric(fast_growth))) +
  geom_point(size = 2,  shape = 20, stroke=2, fill = "blue", color = "blue") +
  geom_smooth(method = "lm", formula = y ~ x, color = color[4], se = F, size = 1)+
  geom_smooth(method = "loess", se = F, colour = color[5], size = 1.5, span = 0.9) +
  labs(x = "ln_curr_assets", y = "fast_growth") +
  theme_bg()
## `geom_smooth()` using formula 'y ~ x'

# it is better to include all squares, LASSO will decide

data_filtered <- data_filtered %>%
  mutate(ln_amort2 = ln_amort^2,
         ln_curr_assets2 = ln_curr_assets^2,
         ln_curr_liab2 = ln_curr_liab^2,
         ln_extra_exp2 = ln_extra_exp^2,
         ln_extra_inc2 = ln_extra_inc^2,
         ln_fixed_assets2 = ln_fixed_assets^2,
         inc_bef_tax_st2 = inc_bef_tax_st^2,
         ln_intang_assets2 = ln_intang_assets^2,
         ln_inventories2 = ln_inventories^2,
         ln_liq_assets2 = ln_liq_assets^2,
         ln_material_exp2 = ln_material_exp^2,
         ln_personnel_exp2 = ln_personnel_exp^2,
         profit_loss_year_st2 = profit_loss_year_st^2,
         ln_tang_assets2 = ln_tang_assets^2,
         ln_subscribed_cap2 = ln_subscribed_cap^2)

polynomials <- c("ln_amort2", "ln_curr_assets2", "ln_curr_liab2", "ln_extra_exp2",
                 "ln_extra_inc2", "ln_fixed_assets2", 
                 "inc_bef_tax_st2", "ln_intang_assets2", "ln_inventories2",
                 "ln_liq_assets2", "ln_material_exp2", "ln_personnel_exp2",
                 "profit_loss_year_st2", "ln_tang_assets2", "ln_subscribed_cap2")

interactions1 <- c("urban_m*ln_extra_exp", "urban_m*ln_personnel_exp", "urban_m*inc_bef_tax_st", "urban_m*ln_fixed_assets2",
                  "origin*ln_extra_exp", "origin*ln_personnel_exp", "origin*inc_bef_tax_st", "origin*ln_fixed_assets2",
                  "region_m*ln_extra_exp", "region_m*ln_personnel_exp", "region_m*inc_bef_tax_st", "region_m*ln_fixed_assets2",
                  "gender*ln_extra_exp", "gender*ln_personnel_exp", "gender*inc_bef_tax_st", "gender*ln_fixed_assets2")

interactions2 <- c("foreign*ln_extra_exp", "foreign*ln_personnel_exp", "foreign*inc_bef_tax_st", "foreign*ln_fixed_assets2",
                   "female*ln_extra_exp", "female*ln_personnel_exp", "female*inc_bef_tax_st", "female*ln_fixed_assets2")

Model Building

I build 6 basic models.

X1 <- c(financial_perform)
X2 <- c(financial_perform, hr)
X3 <- c(financial_perform, hr, geography)
X4 <- c(financial_perform, hr, geography, fe)
X5 <- c(financial_perform, hr, geography, fe, polynomials)
X6 <- c(financial_perform, hr, geography, fe, polynomials, interactions1, interactions2)

# for LASSO
logitvars <- c(financial_perform, hr, geography, fe, polynomials, interactions1, interactions2)
  
# for random forest
rfvars  <-  c(financial_perform, hr, geography, fe)

Now, I try to compare results of OLS and logit.

# ols - model 1
ols_modelx1 <- lm(formula(paste0("fast_growth ~ ", paste0(X1, collapse = " + "))),
                data = data_filtered)
summary(ols_modelx1)
## 
## Call:
## lm(formula = formula(paste0("fast_growth ~ ", paste0(X1, collapse = " + "))), 
##     data = data_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5630 -0.3551 -0.3196  0.6251  0.8808 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.3165551  0.0122624  25.815  < 2e-16 ***
## ln_amort             0.0005861  0.0007831   0.748 0.454189    
## amort_flag          -0.1233684  0.0907240  -1.360 0.173889    
## ln_curr_assets      -0.0055367  0.0014194  -3.901 9.59e-05 ***
## ln_fixed_assets     -0.0027046  0.0019025  -1.422 0.155136    
## ln_intang_assets    -0.0012043  0.0006190  -1.946 0.051699 .  
## ln_liq_assets        0.0033591  0.0009885   3.398 0.000679 ***
## ln_tang_assets      -0.0053018  0.0018681  -2.838 0.004540 ** 
## flag_asset_problem  -0.0111655  0.0483724  -0.231 0.817453    
## ln_curr_liab         0.0086665  0.0008319  10.417  < 2e-16 ***
## ln_extra_exp        -0.0051823  0.0008287  -6.254 4.02e-10 ***
## ln_extra_inc        -0.0026911  0.0006453  -4.170 3.04e-05 ***
## inc_bef_tax_st      -0.0753540  0.0303073  -2.486 0.012908 *  
## ln_inventories      -0.0030928  0.0004365  -7.086 1.39e-12 ***
## ln_material_exp      0.0133353  0.0013203  10.100  < 2e-16 ***
## ln_personnel_exp    -0.0065671  0.0006457 -10.170  < 2e-16 ***
## profit_loss_year_st  0.2070615  0.0359633   5.758 8.56e-09 ***
## ln_subscribed_cap   -0.0019209  0.0008703  -2.207 0.027308 *  
## flag_error           0.0331386  0.0462199   0.717 0.473390    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4753 on 102380 degrees of freedom
##   (212 observations deleted due to missingness)
## Multiple R-squared:  0.008024,   Adjusted R-squared:  0.00785 
## F-statistic: 46.01 on 18 and 102380 DF,  p-value: < 2.2e-16
# logit - model 1
glm_modelx1 <- glm(formula(paste0("fast_growth_f ~", paste0(X1, collapse = " + "))),
                   data = data_filtered, family = "binomial")
summary(glm_modelx1)
## 
## Call:
## glm(formula = formula(paste0("fast_growth_f ~", paste0(X1, collapse = " + "))), 
##     family = "binomial", data = data_filtered)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3071  -0.9355  -0.8771   1.4018   1.9069  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -0.768939   0.054498 -14.109  < 2e-16 ***
## ln_amort             0.002630   0.003442   0.764 0.444839    
## amort_flag          -0.671975   0.481310  -1.396 0.162673    
## ln_curr_assets      -0.025108   0.006269  -4.005 6.20e-05 ***
## ln_fixed_assets     -0.011226   0.008296  -1.353 0.175975    
## ln_intang_assets    -0.005666   0.002775  -2.042 0.041151 *  
## ln_liq_assets        0.014954   0.004395   3.402 0.000668 ***
## ln_tang_assets      -0.023460   0.008151  -2.878 0.003999 ** 
## flag_asset_problem  -0.050676   0.213767  -0.237 0.812610    
## ln_curr_liab         0.038410   0.003740  10.271  < 2e-16 ***
## ln_extra_exp        -0.024045   0.003790  -6.345 2.23e-10 ***
## ln_extra_inc        -0.012561   0.002928  -4.290 1.79e-05 ***
## inc_bef_tax_st      -0.339916   0.136506  -2.490 0.012770 *  
## ln_inventories      -0.013441   0.001920  -7.000 2.57e-12 ***
## ln_material_exp      0.058277   0.005879   9.913  < 2e-16 ***
## ln_personnel_exp    -0.028260   0.002808 -10.063  < 2e-16 ***
## profit_loss_year_st  0.933886   0.161737   5.774 7.74e-09 ***
## ln_subscribed_cap   -0.008955   0.003856  -2.322 0.020209 *  
## flag_error           0.120422   0.212418   0.567 0.570775    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 132683  on 102398  degrees of freedom
## Residual deviance: 131861  on 102380  degrees of freedom
##   (212 observations deleted due to missingness)
## AIC: 131899
## 
## Number of Fisher Scoring iterations: 4
# Check model X2
glm_modelx2 <- glm(formula(paste0("fast_growth_f ~", paste0(X2, collapse = " + "))),
                 data = data_filtered, family = "binomial")
summary(glm_modelx2)
## 
## Call:
## glm(formula = formula(paste0("fast_growth_f ~", paste0(X2, collapse = " + "))), 
##     family = "binomial", data = data_filtered)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4298  -0.9653  -0.8057   1.3228   2.2033  
## 
## Coefficients: (2 not defined because of singularities)
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.222e-01  3.018e-01   1.068   0.2857    
## ln_amort             4.151e-03  3.474e-03   1.195   0.2321    
## amort_flag          -8.277e-01  4.847e-01  -1.708   0.0877 .  
## ln_curr_assets      -1.320e-02  6.364e-03  -2.074   0.0381 *  
## ln_fixed_assets     -3.326e-03  8.403e-03  -0.396   0.6923    
## ln_intang_assets    -6.528e-03  2.814e-03  -2.320   0.0203 *  
## ln_liq_assets        2.083e-02  4.443e-03   4.689 2.74e-06 ***
## ln_tang_assets      -1.739e-02  8.256e-03  -2.106   0.0352 *  
## flag_asset_problem   2.033e-02  2.164e-01   0.094   0.9251    
## ln_curr_liab         2.765e-02  3.782e-03   7.310 2.67e-13 ***
## ln_extra_exp        -2.137e-02  3.825e-03  -5.588 2.30e-08 ***
## ln_extra_inc        -1.365e-02  2.962e-03  -4.608 4.07e-06 ***
## inc_bef_tax_st      -1.736e-01  1.383e-01  -1.255   0.2094    
## ln_inventories      -9.913e-03  1.945e-03  -5.098 3.44e-07 ***
## ln_material_exp      3.543e-02  5.938e-03   5.966 2.43e-09 ***
## ln_personnel_exp    -2.426e-02  2.846e-03  -8.524  < 2e-16 ***
## profit_loss_year_st  8.789e-01  1.635e-01   5.375 7.67e-08 ***
## ln_subscribed_cap   -1.880e-02  3.981e-03  -4.722 2.33e-06 ***
## flag_error           6.888e-02  2.142e-01   0.322   0.7478    
## ceo_count2           2.892e-03  2.244e-02   0.129   0.8974    
## ceo_count3          -2.101e-01  4.825e-02  -4.355 1.33e-05 ***
## ceo_count_flag      -7.256e-02  2.096e-01  -0.346   0.7292    
## foreign             -8.625e-01  4.875e-01  -1.769   0.0768 .  
## female              -2.985e-01  3.929e-01  -0.760   0.4474    
## inoffice_days       -2.043e-04  4.553e-06 -44.884  < 2e-16 ***
## genderfemale         2.408e-01  2.028e-01   1.187   0.2351    
## gendermale           4.411e-02  1.948e-01   0.226   0.8208    
## gendermix                   NA         NA      NA       NA    
## originDomestic      -3.868e-01  2.422e-01  -1.597   0.1102    
## originForeign        4.350e-01  2.540e-01   1.712   0.0868 .  
## originmix                   NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 132683  on 102398  degrees of freedom
## Residual deviance: 129517  on 102370  degrees of freedom
##   (212 observations deleted due to missingness)
## AIC: 129575
## 
## Number of Fisher Scoring iterations: 4
#calculate average marginal effects (dy/dx) for logit
mx2 <- margins(glm_modelx2, vce = "none")

sum_table <- summary(glm_modelx2) %>%
  coef() %>%
  as.data.frame() %>%
  dplyr::select(Estimate) %>%
  mutate(factor = row.names(.)) %>%
  merge(summary(mx2)[,c("factor","AME")])
kable(x = sum_table, format = "html", digits = 3,
      col.names = c("Variable", "Coefficient", "dx/dy"),
      caption = "Average Marginal Effects (dy/dx) for Logit Model")
Average Marginal Effects (dy/dx) for Logit Model
Variable Coefficient dx/dy
amort_flag -0.828 -0.183
ceo_count_flag -0.073 -0.016
ceo_count2 0.003 0.001
ceo_count3 -0.210 -0.045
female -0.299 -0.066
flag_asset_problem 0.020 0.004
flag_error 0.069 0.015
foreign -0.863 -0.191
genderfemale 0.241 0.054
gendermale 0.044 0.010
inc_bef_tax_st -0.174 -0.038
inoffice_days 0.000 0.000
ln_amort 0.004 0.001
ln_curr_assets -0.013 -0.003
ln_curr_liab 0.028 0.006
ln_extra_exp -0.021 -0.005
ln_extra_inc -0.014 -0.003
ln_fixed_assets -0.003 -0.001
ln_intang_assets -0.007 -0.001
ln_inventories -0.010 -0.002
ln_liq_assets 0.021 0.005
ln_material_exp 0.035 0.008
ln_personnel_exp -0.024 -0.005
ln_subscribed_cap -0.019 -0.004
ln_tang_assets -0.017 -0.004
originDomestic -0.387 -0.087
originForeign 0.435 0.103
profit_loss_year_st 0.879 0.194
ols_modelx3 <- lm(formula(paste0("fast_growth ~", paste0(X3, collapse = " + "))),
                data = data_filtered)
summary(ols_modelx3)
## 
## Call:
## lm(formula = formula(paste0("fast_growth ~", paste0(X3, collapse = " + "))), 
##     data = data_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6189 -0.3750 -0.2826  0.5845  1.0029 
## 
## Coefficients: (2 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.614e-01  6.365e-02   8.821  < 2e-16 ***
## ln_amort             8.787e-04  7.746e-04   1.134  0.25667    
## amort_flag          -1.605e-01  8.975e-02  -1.788  0.07374 .  
## ln_curr_assets      -3.179e-03  1.411e-03  -2.253  0.02429 *  
## ln_fixed_assets     -8.577e-04  1.882e-03  -0.456  0.64863    
## ln_intang_assets    -1.437e-03  6.142e-04  -2.341  0.01926 *  
## ln_liq_assets        4.592e-03  9.786e-04   4.693 2.70e-06 ***
## ln_tang_assets      -3.729e-03  1.849e-03  -2.017  0.04371 *  
## flag_asset_problem   4.918e-03  4.784e-02   0.103  0.91812    
## ln_curr_liab         5.977e-03  8.261e-04   7.235 4.69e-13 ***
## ln_extra_exp        -4.438e-03  8.200e-04  -5.412 6.25e-08 ***
## ln_extra_inc        -2.733e-03  6.394e-04  -4.274 1.92e-05 ***
## inc_bef_tax_st      -4.196e-02  3.001e-02  -1.398  0.16205    
## ln_inventories      -2.128e-03  4.334e-04  -4.911 9.09e-07 ***
## ln_material_exp      7.784e-03  1.316e-03   5.917 3.30e-09 ***
## ln_personnel_exp    -5.391e-03  6.408e-04  -8.413  < 2e-16 ***
## profit_loss_year_st  2.006e-01  3.562e-02   5.631 1.80e-08 ***
## ln_subscribed_cap   -4.233e-03  8.739e-04  -4.844 1.28e-06 ***
## flag_error           1.855e-02  4.571e-02   0.406  0.68481    
## ceo_count2          -1.379e-03  4.963e-03  -0.278  0.78118    
## ceo_count3          -4.676e-02  1.028e-02  -4.549 5.38e-06 ***
## ceo_count_flag      -1.476e-02  4.428e-02  -0.333  0.73888    
## foreign             -1.684e-01  1.022e-01  -1.649  0.09924 .  
## female              -6.698e-02  8.144e-02  -0.822  0.41084    
## inoffice_days       -4.199e-05  9.178e-07 -45.752  < 2e-16 ***
## genderfemale         5.261e-02  4.212e-02   1.249  0.21159    
## gendermale           7.562e-03  4.038e-02   0.187  0.85146    
## gendermix                   NA         NA      NA       NA    
## originDomestic      -7.513e-02  5.094e-02  -1.475  0.14029    
## originForeign        8.542e-02  5.324e-02   1.605  0.10861    
## originmix                   NA         NA      NA       NA    
## urban_mother big    -1.449e-02  4.434e-03  -3.268  0.00109 ** 
## urban_mother        -3.872e-03  4.102e-03  -0.944  0.34521    
## region_mEast        -8.664e-03  4.082e-03  -2.123  0.03380 *  
## region_mWest        -1.955e-02  4.669e-03  -4.188 2.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.47 on 102366 degrees of freedom
##   (212 observations deleted due to missingness)
## Multiple R-squared:   0.03,  Adjusted R-squared:  0.02969 
## F-statistic: 98.92 on 32 and 102366 DF,  p-value: < 2.2e-16
glm_modelx3 <- glm(formula(paste0("fast_growth_f ~", paste0(X3, collapse = " + "))),
                 data = data_filtered, family = "binomial")
summary(glm_modelx3)
## 
## Call:
## glm(formula = formula(paste0("fast_growth_f ~", paste0(X3, collapse = " + "))), 
##     family = "binomial", data = data_filtered)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4344  -0.9653  -0.8037   1.3214   2.2015  
## 
## Coefficients: (2 not defined because of singularities)
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.882e-01  3.021e-01   1.285   0.1988    
## ln_amort             4.356e-03  3.475e-03   1.254   0.2100    
## amort_flag          -8.593e-01  4.850e-01  -1.772   0.0764 .  
## ln_curr_assets      -1.423e-02  6.367e-03  -2.235   0.0254 *  
## ln_fixed_assets     -3.395e-03  8.408e-03  -0.404   0.6864    
## ln_intang_assets    -6.470e-03  2.815e-03  -2.298   0.0215 *  
## ln_liq_assets        2.087e-02  4.445e-03   4.696 2.65e-06 ***
## ln_tang_assets      -1.640e-02  8.263e-03  -1.984   0.0472 *  
## flag_asset_problem   2.017e-02  2.163e-01   0.093   0.9257    
## ln_curr_liab         2.693e-02  3.783e-03   7.119 1.09e-12 ***
## ln_extra_exp        -2.088e-02  3.826e-03  -5.457 4.85e-08 ***
## ln_extra_inc        -1.297e-02  2.965e-03  -4.373 1.23e-05 ***
## inc_bef_tax_st      -1.814e-01  1.383e-01  -1.312   0.1895    
## ln_inventories      -9.263e-03  1.948e-03  -4.755 1.98e-06 ***
## ln_material_exp      3.349e-02  5.951e-03   5.627 1.83e-08 ***
## ln_personnel_exp    -2.359e-02  2.850e-03  -8.277  < 2e-16 ***
## profit_loss_year_st  9.156e-01  1.637e-01   5.594 2.22e-08 ***
## ln_subscribed_cap   -2.020e-02  3.990e-03  -5.062 4.14e-07 ***
## flag_error           6.079e-02  2.144e-01   0.283   0.7768    
## ceo_count2          -7.427e-04  2.246e-02  -0.033   0.9736    
## ceo_count3          -2.149e-01  4.829e-02  -4.450 8.61e-06 ***
## ceo_count_flag      -6.813e-02  2.096e-01  -0.325   0.7452    
## foreign             -8.260e-01  4.876e-01  -1.694   0.0903 .  
## female              -3.120e-01  3.926e-01  -0.795   0.4268    
## inoffice_days       -2.037e-04  4.552e-06 -44.749  < 2e-16 ***
## genderfemale         2.466e-01  2.026e-01   1.217   0.2235    
## gendermale           3.447e-02  1.946e-01   0.177   0.8594    
## gendermix                   NA         NA      NA       NA    
## originDomestic      -3.672e-01  2.422e-01  -1.516   0.1296    
## originForeign        4.192e-01  2.541e-01   1.650   0.0990 .  
## originmix                   NA         NA      NA       NA    
## urban_mother big    -6.446e-02  2.004e-02  -3.217   0.0013 ** 
## urban_mother        -1.572e-02  1.847e-02  -0.851   0.3949    
## region_mEast        -3.921e-02  1.851e-02  -2.119   0.0341 *  
## region_mWest        -8.996e-02  2.137e-02  -4.209 2.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 132683  on 102398  degrees of freedom
## Residual deviance: 129466  on 102366  degrees of freedom
##   (212 observations deleted due to missingness)
## AIC: 129532
## 
## Number of Fisher Scoring iterations: 4
ols_modelx4 <- lm(formula(paste0("fast_growth ~", paste0(X4, collapse = " + "))),
                data = data_filtered)
summary(ols_modelx4)
## 
## Call:
## lm(formula = formula(paste0("fast_growth ~", paste0(X4, collapse = " + "))), 
##     data = data_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7937 -0.3563 -0.2719  0.5911  0.9990 
## 
## Coefficients: (2 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.277e-01  6.957e-02   7.585 3.35e-14 ***
## ln_amort             1.949e-03  7.710e-04   2.528 0.011466 *  
## amort_flag          -1.241e-01  8.895e-02  -1.395 0.162892    
## ln_curr_assets      -2.446e-03  1.450e-03  -1.687 0.091586 .  
## ln_fixed_assets     -6.977e-04  1.867e-03  -0.374 0.708662    
## ln_intang_assets    -1.402e-03  6.109e-04  -2.294 0.021771 *  
## ln_liq_assets        3.945e-03  9.755e-04   4.044 5.26e-05 ***
## ln_tang_assets      -3.505e-03  1.833e-03  -1.912 0.055884 .  
## flag_asset_problem   8.651e-03  4.743e-02   0.182 0.855293    
## ln_curr_liab         4.762e-03  8.217e-04   5.795 6.85e-09 ***
## ln_extra_exp        -4.053e-03  8.132e-04  -4.984 6.24e-07 ***
## ln_extra_inc        -3.288e-03  6.370e-04  -5.163 2.44e-07 ***
## inc_bef_tax_st      -4.212e-02  2.986e-02  -1.410 0.158410    
## ln_inventories      -1.503e-03  4.421e-04  -3.398 0.000678 ***
## ln_material_exp      9.046e-03  1.312e-03   6.895 5.41e-12 ***
## ln_personnel_exp    -4.422e-03  6.372e-04  -6.939 3.97e-12 ***
## profit_loss_year_st  2.017e-01  3.533e-02   5.708 1.14e-08 ***
## ln_subscribed_cap   -5.034e-03  8.738e-04  -5.760 8.42e-09 ***
## flag_error           1.117e-02  4.531e-02   0.247 0.805262    
## ceo_count2           3.018e-03  4.922e-03   0.613 0.539777    
## ceo_count3          -4.069e-02  1.020e-02  -3.991 6.57e-05 ***
## ceo_count_flag      -1.368e-02  4.389e-02  -0.312 0.755347    
## foreign             -1.751e-01  1.013e-01  -1.728 0.083905 .  
## female              -5.775e-02  8.073e-02  -0.715 0.474429    
## inoffice_days       -3.374e-05  9.429e-07 -35.781  < 2e-16 ***
## genderfemale         4.761e-02  4.175e-02   1.140 0.254094    
## gendermale           1.282e-02  4.003e-02   0.320 0.748862    
## gendermix                   NA         NA      NA       NA    
## originDomestic      -8.153e-02  5.050e-02  -1.614 0.106425    
## originForeign        8.254e-02  5.278e-02   1.564 0.117816    
## originmix                   NA         NA      NA       NA    
## urban_mother big    -1.478e-02  4.412e-03  -3.350 0.000808 ***
## urban_mother        -3.548e-03  4.085e-03  -0.868 0.385147    
## region_mEast        -8.864e-03  4.055e-03  -2.186 0.028806 *  
## region_mWest        -1.907e-02  4.638e-03  -4.112 3.93e-05 ***
## year2011             6.759e-03  5.475e-03   1.235 0.216982    
## year2012             4.183e-02  5.357e-03   7.809 5.82e-15 ***
## year2013             4.402e-02  5.280e-03   8.337  < 2e-16 ***
## year2014             4.686e-02  5.328e-03   8.794  < 2e-16 ***
## year2015             5.511e-02  5.431e-03  10.147  < 2e-16 ***
## ind2_cat26          -7.682e-02  2.967e-02  -2.589 0.009617 ** 
## ind2_cat27          -6.822e-02  3.006e-02  -2.269 0.023242 *  
## ind2_cat28          -8.422e-02  2.938e-02  -2.867 0.004143 ** 
## ind2_cat29          -3.986e-03  3.124e-02  -0.128 0.898447    
## ind2_cat30           7.559e-03  3.291e-02   0.230 0.818332    
## ind2_cat32          -8.025e-02  3.370e-02  -2.381 0.017244 *  
## ind2_cat33          -5.575e-02  2.940e-02  -1.896 0.057946 .  
## ind2_cat40           7.844e-03  3.241e-02   0.242 0.808776    
## ind2_cat55          -5.758e-02  2.938e-02  -1.960 0.050035 .  
## ind2_cat56          -6.403e-02  2.915e-02  -2.197 0.028025 *  
## ind2_cat60          -1.684e-05  3.215e-02  -0.001 0.999582    
## ind2_cat99          -2.332e-01  8.521e-02  -2.737 0.006203 ** 
## imputed1             2.115e-01  5.666e-03  37.323  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4658 on 102348 degrees of freedom
##   (212 observations deleted due to missingness)
## Multiple R-squared:  0.04744,    Adjusted R-squared:  0.04698 
## F-statistic:   102 on 50 and 102348 DF,  p-value: < 2.2e-16
glm_modelx4 <- glm(formula(paste0("fast_growth_f ~", paste0(X4, collapse = " + "))),
                 data = data_filtered, family = "binomial")
summary(glm_modelx4)
## 
## Call:
## glm(formula = formula(paste0("fast_growth_f ~", paste0(X4, collapse = " + "))), 
##     family = "binomial", data = data_filtered)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7545  -0.9343  -0.7891   1.3323   2.1940  
## 
## Coefficients: (2 not defined because of singularities)
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          2.433e-01  3.308e-01   0.735 0.462167    
## ln_amort             9.184e-03  3.532e-03   2.600 0.009326 ** 
## amort_flag          -7.111e-01  4.869e-01  -1.460 0.144184    
## ln_curr_assets      -1.142e-02  6.655e-03  -1.717 0.086019 .  
## ln_fixed_assets     -2.948e-03  8.501e-03  -0.347 0.728768    
## ln_intang_assets    -6.389e-03  2.844e-03  -2.246 0.024700 *  
## ln_liq_assets        1.829e-02  4.505e-03   4.059 4.92e-05 ***
## ln_tang_assets      -1.578e-02  8.350e-03  -1.890 0.058783 .  
## flag_asset_problem   4.066e-02  2.175e-01   0.187 0.851725    
## ln_curr_liab         2.209e-02  3.828e-03   5.769 7.95e-09 ***
## ln_extra_exp        -1.925e-02  3.845e-03  -5.007 5.53e-07 ***
## ln_extra_inc        -1.547e-02  2.999e-03  -5.158 2.49e-07 ***
## inc_bef_tax_st      -1.847e-01  1.396e-01  -1.323 0.185791    
## ln_inventories      -6.635e-03  2.025e-03  -3.276 0.001053 ** 
## ln_material_exp      3.989e-02  6.050e-03   6.593 4.31e-11 ***
## ln_personnel_exp    -1.996e-02  2.892e-03  -6.900 5.18e-12 ***
## profit_loss_year_st  9.307e-01  1.647e-01   5.651 1.59e-08 ***
## ln_subscribed_cap   -2.385e-02  4.051e-03  -5.886 3.97e-09 ***
## flag_error           2.795e-02  2.166e-01   0.129 0.897317    
## ceo_count2           1.770e-02  2.265e-02   0.782 0.434453    
## ceo_count3          -1.916e-01  4.866e-02  -3.937 8.24e-05 ***
## ceo_count_flag      -6.923e-02  2.103e-01  -0.329 0.742060    
## foreign             -8.605e-01  4.885e-01  -1.761 0.078182 .  
## female              -2.834e-01  3.955e-01  -0.717 0.473657    
## inoffice_days       -1.651e-04  4.667e-06 -35.383  < 2e-16 ***
## genderfemale         2.309e-01  2.041e-01   1.131 0.257970    
## gendermale           5.366e-02  1.960e-01   0.274 0.784309    
## gendermix                   NA         NA      NA       NA    
## originDomestic      -3.992e-01  2.428e-01  -1.644 0.100118    
## originForeign        4.078e-01  2.546e-01   1.601 0.109274    
## originmix                   NA         NA      NA       NA    
## urban_mother big    -6.720e-02  2.031e-02  -3.309 0.000937 ***
## urban_mother        -1.476e-02  1.873e-02  -0.788 0.430656    
## region_mEast        -4.022e-02  1.872e-02  -2.149 0.031647 *  
## region_mWest        -8.890e-02  2.159e-02  -4.117 3.84e-05 ***
## year2011             3.384e-02  2.614e-02   1.295 0.195482    
## year2012             1.982e-01  2.521e-02   7.862 3.77e-15 ***
## year2013             2.064e-01  2.479e-02   8.325  < 2e-16 ***
## year2014             2.177e-01  2.501e-02   8.704  < 2e-16 ***
## year2015             2.545e-01  2.541e-02  10.016  < 2e-16 ***
## ind2_cat26          -3.479e-01  1.334e-01  -2.607 0.009128 ** 
## ind2_cat27          -3.076e-01  1.353e-01  -2.273 0.023019 *  
## ind2_cat28          -3.827e-01  1.320e-01  -2.899 0.003747 ** 
## ind2_cat29          -1.927e-02  1.403e-01  -0.137 0.890732    
## ind2_cat30           3.186e-02  1.475e-01   0.216 0.829017    
## ind2_cat32          -3.659e-01  1.542e-01  -2.373 0.017657 *  
## ind2_cat33          -2.508e-01  1.320e-01  -1.899 0.057540 .  
## ind2_cat40           3.048e-02  1.453e-01   0.210 0.833837    
## ind2_cat55          -2.606e-01  1.320e-01  -1.974 0.048390 *  
## ind2_cat56          -2.892e-01  1.309e-01  -2.210 0.027099 *  
## ind2_cat60          -3.308e-03  1.442e-01  -0.023 0.981701    
## ind2_cat99          -1.195e+00  4.702e-01  -2.542 0.011025 *  
## imputed1             8.633e-01  2.513e-02  34.357  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 132683  on 102398  degrees of freedom
## Residual deviance: 127800  on 102348  degrees of freedom
##   (212 observations deleted due to missingness)
## AIC: 127902
## 
## Number of Fisher Scoring iterations: 4
#calculate average marginal effects (dy/dx) for logit
# vce="none" makes it run much faster, here we do not need variances

m <- margins(glm_modelx4, vce = "none")

sum_table2 <- summary(glm_modelx4) %>%
  coef() %>%
  as.data.frame() %>%
  select(Estimate, `Std. Error`) %>%
  mutate(factor = row.names(.)) %>%
  merge(summary(m)[,c("factor","AME")])

kable(x = sum_table2, format = "html", digits = 3,
      col.names = c("Variable", "Coefficient", "SE", "dx/dy"),
      caption = "Average Marginal Effects (dy/dx) for Logit Model")
Average Marginal Effects (dy/dx) for Logit Model
Variable Coefficient SE dx/dy
amort_flag -0.711 0.487 -0.154
ceo_count_flag -0.069 0.210 -0.015
ceo_count2 0.018 0.023 0.004
ceo_count3 -0.192 0.049 -0.040
female -0.283 0.396 -0.061
flag_asset_problem 0.041 0.218 0.009
flag_error 0.028 0.217 0.006
foreign -0.860 0.489 -0.187
genderfemale 0.231 0.204 0.051
gendermale 0.054 0.196 0.011
imputed1 0.863 0.025 0.203
inc_bef_tax_st -0.185 0.140 -0.040
ind2_cat26 -0.348 0.133 -0.078
ind2_cat27 -0.308 0.135 -0.069
ind2_cat28 -0.383 0.132 -0.085
ind2_cat29 -0.019 0.140 -0.004
ind2_cat30 0.032 0.148 0.007
ind2_cat32 -0.366 0.154 -0.081
ind2_cat33 -0.251 0.132 -0.057
ind2_cat40 0.030 0.145 0.007
ind2_cat55 -0.261 0.132 -0.059
ind2_cat56 -0.289 0.131 -0.065
ind2_cat60 -0.003 0.144 -0.001
ind2_cat99 -1.195 0.470 -0.230
inoffice_days 0.000 0.000 0.000
ln_amort 0.009 0.004 0.002
ln_curr_assets -0.011 0.007 -0.002
ln_curr_liab 0.022 0.004 0.005
ln_extra_exp -0.019 0.004 -0.004
ln_extra_inc -0.015 0.003 -0.003
ln_fixed_assets -0.003 0.009 -0.001
ln_intang_assets -0.006 0.003 -0.001
ln_inventories -0.007 0.002 -0.001
ln_liq_assets 0.018 0.005 0.004
ln_material_exp 0.040 0.006 0.009
ln_personnel_exp -0.020 0.003 -0.004
ln_subscribed_cap -0.024 0.004 -0.005
ln_tang_assets -0.016 0.008 -0.003
originDomestic -0.399 0.243 -0.088
originForeign 0.408 0.255 0.095
profit_loss_year_st 0.931 0.165 0.202
region_mEast -0.040 0.019 -0.009
region_mWest -0.089 0.022 -0.019
urban_mother -0.015 0.019 -0.003
urban_mother big -0.067 0.020 -0.015
year2011 0.034 0.026 0.007
year2012 0.198 0.025 0.042
year2013 0.206 0.025 0.044
year2014 0.218 0.025 0.047
year2015 0.254 0.025 0.055

Dividing dataset to two sets (80% for training and 20% for holdout)

set.seed(1234)

data_filtered <- data_filtered %>%
  drop_na(flag_asset_problem, flag_error)

data_filtered <- data_filtered %>% 
  mutate(fast_growth_f = factor(fast_growth_f, 
                        labels = make.names(levels(fast_growth_f))))

train_indices <- as.integer(createDataPartition(data_filtered$fast_growth_f, p = 0.8, list = FALSE))
data_train <- data_filtered[train_indices, ]
data_holdout <- data_filtered[-train_indices, ]

dim(data_train)
## [1] 81920    87
dim(data_holdout)
## [1] 20479    87

Part I: Probability Prediction

# 5 cross-validation

train_control <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummaryExtended,
  savePredictions = TRUE)

logit_model_vars <- list("X1" = X1, "X2" = X2, "X3" = X3, "X4" = X4, "X5" = X5, "X6" = X6)

CV_RMSE_folds <- list()
logit_models <- list()


for (model_name in names(logit_model_vars)) {

  features <- logit_model_vars[[model_name]]

  set.seed(1234)
  glm_model <- train(
    formula(paste0("fast_growth_f ~", paste0(features, collapse = " + "))),
    method = "glm",
    data = data_train,
    family = binomial,
    trControl = train_control
  )

  logit_models[[model_name]] <- glm_model
  # Calculate RMSE on test for each fold
  CV_RMSE_folds[[model_name]] <- glm_model$resample[,c("Resample", "RMSE")]

}
# Logit lasso -----------------------------------------------------------

lambda <- 10^seq(-1, -4, length = 10)
grid <- expand.grid("alpha" = 1, lambda = lambda)

set.seed(1234)
system.time({
  logit_lasso_model <- train(
    formula(paste0("fast_growth_f ~", paste0(logitvars, collapse = " + "))),
    data = data_train,
    method = "glmnet",
    preProcess = c("center", "scale"),
    family = "binomial",
    trControl = train_control,
    tuneGrid = grid,
    na.action = na.exclude
  )
})
##    user  system elapsed 
## 115.098   3.702 120.353
tuned_logit_lasso_model <- logit_lasso_model$finalModel
best_lambda <- logit_lasso_model$bestTune$lambda
logit_models[["LASSO"]] <- logit_lasso_model
lasso_coeffs <- as.matrix(coef(tuned_logit_lasso_model, best_lambda))

CV_RMSE_folds[["LASSO"]] <- logit_lasso_model$resample[,c("Resample", "RMSE")]
# Draw ROC Curve and calculate AUC for each folds --------------------------------
CV_AUC_folds <- list()

for (model_name in names(logit_models)) {

  auc <- list()
  model <- logit_models[[model_name]]
  for (fold in c("Fold1", "Fold2", "Fold3", "Fold4", "Fold5")) {
    cv_fold <-
      model$pred %>%
      filter(Resample == fold)

    roc_obj <- roc(cv_fold$obs, cv_fold$X1)
    auc[[fold]] <- as.numeric(roc_obj$auc)
  }

  CV_AUC_folds[[model_name]] <- data.frame("Resample" = names(auc),
                                              "AUC" = unlist(auc))
}

# For each model: average RMSE and average AUC for models ----------------------------------

CV_RMSE <- list()
CV_AUC <- list()

for (model_name in names(logit_models)) {
  CV_RMSE[[model_name]] <- mean(CV_RMSE_folds[[model_name]]$RMSE)
  CV_AUC[[model_name]] <- mean(CV_AUC_folds[[model_name]]$AUC)
}

# 7 models, (6 logit and the logit lasso). For each we have a 5-CV RMSE and AUC.
# We pick our preferred model based on that. -----------------------------------------------

nvars <- lapply(logit_models, FUN = function(x) length(x$coefnames))
nvars[["LASSO"]] <- sum(lasso_coeffs != 0)

logit_summary1 <- data.frame("Number of predictors" = unlist(nvars),
                             "CV RMSE" = unlist(CV_RMSE),
                             "CV AUC" = unlist(CV_AUC))

kable(x = logit_summary1, format = "html", booktabs = TRUE,  digits = 3, row.names = TRUE,
      linesep = "", col.names = c("Number of predictors","CV RMSE","CV AUC")) 
Number of predictors CV RMSE CV AUC
X1 18 0.475 0.554
X2 30 0.470 0.608
X3 34 0.470 0.609
X4 52 0.466 0.623
X5 67 0.464 0.634
X6 115 0.463 0.636
LASSO 102 0.463 0.616

Based on this information, I choose for model 6. Even though the difference between model 4 and model 6 in terms of CV RMSE not so huge, AUC is still quite significantly better. Interestingly, LASSO shows the same result for CV RMSE, however, it is much worse in CV AUC.

best_logit_no_loss <- logit_models[["X6"]]

logit_predicted_probabilities_holdout <- predict(best_logit_no_loss, newdata = data_holdout, type = "prob")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
data_holdout[,"best_logit_no_loss_pred"] <- logit_predicted_probabilities_holdout[,"X1"]
RMSE(data_holdout[, "best_logit_no_loss_pred", drop=TRUE], data_holdout$fast_growth)
## [1] 0.4646898
# discrete ROC (with thresholds in steps)
thresholds <- seq(0.05, 0.75, by = 0.05)

cm <- list()
true_positive_rates <- c()
false_positive_rates <- c()
for (thr in thresholds) {
  holdout_prediction <- ifelse(data_holdout[,"best_logit_no_loss_pred"] < thr, "X0", "X1") %>%
    factor(levels = c("X0", "X1"))
  cm_thr <- confusionMatrix(holdout_prediction, data_holdout$fast_growth_f)$table
  cm[[as.character(thr)]] <- cm_thr
  true_positive_rates <- c(true_positive_rates, cm_thr["X1", "X1"] /
                             (cm_thr["X1", "X1"] + cm_thr["X0", "X1"]))
  false_positive_rates <- c(false_positive_rates, cm_thr["X1", "X0"] /
                              (cm_thr["X1", "X0"] + cm_thr["X0", "X0"]))
}

tpr_fpr_for_thresholds <- tibble(
  "threshold" = thresholds,
  "true_positive_rate" = true_positive_rates,
  "false_positive_rate" = false_positive_rates
)

discrete_roc_plot <- ggplot(
  data = tpr_fpr_for_thresholds,
  aes(x = false_positive_rate, y = true_positive_rate, color = threshold)) +
  labs(x = "False positive rate (1 - Specificity)", y = "True positive rate (Sensitivity)") +
  geom_point(size=2, alpha=0.8) +
  scale_color_viridis(option = "D", direction = -1) +
  scale_x_continuous(expand = c(0.01,0.01), limit=c(0,1), breaks = seq(0,1,0.1)) +
  scale_y_continuous(expand = c(0.01,0.01), limit=c(0,1), breaks = seq(0,1,0.1)) +
  theme_bg() +
  theme(legend.position ="right") +
  theme(legend.title = element_text(size = 4), 
        legend.text = element_text(size = 4),
        legend.key.size = unit(.4, "cm")) 
discrete_roc_plot

# continuous ROC on holdout with best model (Logit 4) -------------------------------------------

roc_obj_holdout <- roc(data_holdout$fast_growth_f, data_holdout$best_logit_no_loss_pred)
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
createRocPlot(roc_obj_holdout, "best_logit_no_loss_roc_plot_holdout")
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

# Confusion table with different tresholds ----------------------------------------------------------

# default: the threshold 0.5 is used to convert probabilities to binary classes
logit_class_prediction <- predict(best_logit_no_loss, newdata = data_holdout)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
summary(logit_class_prediction)
##    X0    X1 
## 18521  1958
# confusion matrix: summarize different type of errors and successfully predicted cases
# positive = "yes": explicitly specify the positive case
cm_object1 <- confusionMatrix(logit_class_prediction, data_holdout$fast_growth_f, positive = "X1")
cm1 <- cm_object1$table
cm1
##           Reference
## Prediction    X0    X1
##         X0 12524  5997
##         X1   773  1185
# we can apply different thresholds

# 0.4 
holdout_prediction <-
  ifelse(data_holdout$best_logit_no_loss_pred < 0.4, "X0", "X1") %>%
  factor(levels = c("X0", "X1"))
cm_object1b <- confusionMatrix(holdout_prediction, data_holdout$fast_growth_f)
cm1b <- cm_object1b$table
cm1b
##           Reference
## Prediction    X0    X1
##         X0 10455  4419
##         X1  2842  2763
# a sensible choice: mean of predicted probabilities
mean_predicted_default_prob <- mean(data_holdout$best_logit_no_loss_pred)
mean_predicted_default_prob
## [1] 0.3507365
holdout_prediction <-
  ifelse(data_holdout$best_logit_no_loss_pred < mean_predicted_default_prob, "X0", "X1") %>%
  factor(levels = c("X0", "X1"))
cm_object2 <- confusionMatrix(holdout_prediction, data_holdout$fast_growth_f)
cm2 <- cm_object2$table
cm2
##           Reference
## Prediction   X0   X1
##         X0 7965 2960
##         X1 5332 4222
create_calibration_plot(data_holdout,                    
  prob_var = "best_logit_no_loss_pred", 
  actual_var = "fast_growth",
  n_bins = 15)
## Warning: Removed 2 row(s) containing missing values (geom_path).

Classification

Let's assume my business problem sounds in the following way: I need to know whether a firm is going to be fast growing, because my client wants to invest only in fast growing firms (to buy its market stocks, for example). In terms of risk aversity, I know that my client is very risk averse and does not want to lose any of his money or have low incomes from these investments (he has good other options for investments). At the same time, he wants to multiply wealth, that is why not capturing successful companies will make him a bit dissapointed. Bearing this in mind, I would specify my loss function in the following way:

False Positive (predicting that the firm is fast-growing when in fact it is not) = 9 False Negative (predicting that the firm is not fast-growing when in fact it is) = 3

FP <- 9
FN <- 3

cost = FN/FP # 0.33

prevelance = sum(data_train$fast_growth)/length(data_train$fast_growth) # 0.3506836

In some cases, I received results Inf. If I understood correctly, the main reason is that in such folders, all predicted probabilities were classified as Negatives (since my loss function punishes FP quite severely), that is why it was not possible to find an optimal threshold in these folders. For this and the following tasks, I imputed Inf with NAs.

# Draw ROC Curve and find optimal threshold with loss function --------------------------

best_tresholds <- list()
expected_loss <- list()
logit_cv_rocs <- list()
logit_cv_threshold <- list()
logit_cv_expected_loss <- list()

for (model_name in names(logit_models)) {

  model <- logit_models[[model_name]]
  colname <- paste0(model_name,"_prediction")

  best_tresholds_cv <- list()
  expected_loss_cv <- list()

  for (fold in c("Fold1", "Fold2", "Fold3", "Fold4", "Fold5")) {
    cv_fold <-
      model$pred %>%
      filter(Resample == fold)

    roc_obj <- roc(cv_fold$obs, cv_fold$X1)
    best_treshold <- coords(roc_obj, "best", ret="all", transpose = FALSE,
                            best.method = "youden", best.weights = c(cost, prevelance))
    best_tresholds_cv[[fold]] <- best_treshold$threshold
    expected_loss_cv[[fold]] <- (best_treshold$fp*FP + best_treshold$fn*FN)/length(cv_fold$X1)
  }

  # average
  best_tresholds_cv <- unlist(best_tresholds_cv)
  best_tresholds_cv[is.infinite(best_tresholds_cv)] <- NA 
  best_tresholds[[model_name]] <- mean(best_tresholds_cv, na.rm = T)
  expected_loss[[model_name]] <- mean(unlist(expected_loss_cv))

  # for fold #5
  logit_cv_rocs[[model_name]] <- roc_obj
  best_treshold <- 
  logit_cv_threshold[[model_name]] <- best_treshold
  logit_cv_expected_loss[[model_name]] <- expected_loss_cv[[fold]]

  }

logit_summary2 <- data.frame("Avg of optimal thresholds" = unlist(best_tresholds),
                             "Threshold for Fold5" = sapply(logit_cv_threshold, function(x) {x$threshold}),
                             "Avg expected loss" = unlist(expected_loss),
                             "Expected loss for Fold5" = unlist(logit_cv_expected_loss))

kable(x = logit_summary2, format = "html", booktabs=TRUE,  digits = 3, row.names = TRUE,
      linesep = "", col.names = c("Avg of optimal thresholds","Threshold for fold #5",
                                  "Avg expected loss","Expected loss for fold #5"))
Avg of optimal thresholds Threshold for fold #5 Avg expected loss Expected loss for fold #5
X1 0.549 0.536 1.051 1.051
X2 0.598 Inf 1.052 1.052
X3 0.587 Inf 1.052 1.052
X4 0.706 0.696 1.051 1.051
X5 0.730 0.737 1.050 1.051
X6 0.721 0.663 1.049 1.048
LASSO 0.722 0.666 1.051 1.052

Based on an average expected loss, I should choose model X6.

best_logit_with_loss <- logit_models[["X6"]]
best_logit_optimal_treshold <- best_tresholds[["X6"]]

logit_predicted_probabilities_holdout <- predict(best_logit_with_loss, newdata = data_holdout, type = "prob")
data_holdout[,"best_logit_with_loss_pred"] <- logit_predicted_probabilities_holdout[,"X1"]

# ROC curve on holdout
roc_obj_holdout <- roc(data_holdout$fast_growth, data_holdout[, "best_logit_with_loss_pred", drop=TRUE])

# Get expected loss on holdout
holdout_treshold <- coords(roc_obj_holdout, x = best_logit_optimal_treshold, input= "threshold",
                           ret="all", transpose = FALSE)
expected_loss_holdout <- (holdout_treshold$fp*FP + holdout_treshold$fn*FN)/length(data_holdout$fast_growth_f)
expected_loss_holdout
## [1] 1.052102
# Confusion table on holdout with optimal threshold
holdout_prediction <-
  ifelse(data_holdout$best_logit_with_loss_pred < best_logit_optimal_treshold, "X0", "X1") %>%
  factor(levels = c("X0", "X1"))
cm_object3 <- confusionMatrix(holdout_prediction, data_holdout$fast_growth_f)
cm3 <- cm_object3$table
cm3
##           Reference
## Prediction    X0    X1
##         X0 13276  7119
##         X1    21    63

Classification with Random Forest

data_for_graph <- data_train
levels(data_for_graph$fast_growth_f) <- list("slow" = "X0", "fast" = "X1")

set.seed(1234)
rf_for_graph <-
  rpart(
    formula = formula(paste0("fast_growth_f ~ ", paste0(rfvars, collapse = " + "))),
    data = data_for_graph,
    control = rpart.control(cp = 0.0028, minbucket = 100)
  )

rpart.plot(rf_for_graph, tweak=1, digits=2, extra=107, under = TRUE)

# 5 fold cross-validation

train_control <- trainControl(
  method = "cv",
  n = 5,
  classProbs = TRUE, # same as probability = TRUE in ranger
  summaryFunction = twoClassSummaryExtended,
  savePredictions = TRUE
)
train_control$verboseIter <- TRUE

tune_grid <- expand.grid(
  .mtry = c(5, 6, 7),
  .splitrule = "gini",
  .min.node.size = c(10, 15)
)

# getModelInfo("ranger")
set.seed(1234)
rf_model_p <- train(
  formula(paste0("fast_growth_f ~ ", paste0(rfvars, collapse = " + "))),
  method = "ranger",
  data = data_train,
  tuneGrid = tune_grid,
  trControl = train_control
)
## + Fold1: mtry=5, splitrule=gini, min.node.size=10 
## - Fold1: mtry=5, splitrule=gini, min.node.size=10 
## + Fold1: mtry=6, splitrule=gini, min.node.size=10 
## - Fold1: mtry=6, splitrule=gini, min.node.size=10 
## + Fold1: mtry=7, splitrule=gini, min.node.size=10 
## - Fold1: mtry=7, splitrule=gini, min.node.size=10 
## + Fold1: mtry=5, splitrule=gini, min.node.size=15 
## - Fold1: mtry=5, splitrule=gini, min.node.size=15 
## + Fold1: mtry=6, splitrule=gini, min.node.size=15 
## - Fold1: mtry=6, splitrule=gini, min.node.size=15 
## + Fold1: mtry=7, splitrule=gini, min.node.size=15 
## - Fold1: mtry=7, splitrule=gini, min.node.size=15 
## + Fold2: mtry=5, splitrule=gini, min.node.size=10 
## - Fold2: mtry=5, splitrule=gini, min.node.size=10 
## + Fold2: mtry=6, splitrule=gini, min.node.size=10 
## - Fold2: mtry=6, splitrule=gini, min.node.size=10 
## + Fold2: mtry=7, splitrule=gini, min.node.size=10 
## - Fold2: mtry=7, splitrule=gini, min.node.size=10 
## + Fold2: mtry=5, splitrule=gini, min.node.size=15 
## - Fold2: mtry=5, splitrule=gini, min.node.size=15 
## + Fold2: mtry=6, splitrule=gini, min.node.size=15 
## - Fold2: mtry=6, splitrule=gini, min.node.size=15 
## + Fold2: mtry=7, splitrule=gini, min.node.size=15 
## - Fold2: mtry=7, splitrule=gini, min.node.size=15 
## + Fold3: mtry=5, splitrule=gini, min.node.size=10 
## - Fold3: mtry=5, splitrule=gini, min.node.size=10 
## + Fold3: mtry=6, splitrule=gini, min.node.size=10 
## - Fold3: mtry=6, splitrule=gini, min.node.size=10 
## + Fold3: mtry=7, splitrule=gini, min.node.size=10 
## - Fold3: mtry=7, splitrule=gini, min.node.size=10 
## + Fold3: mtry=5, splitrule=gini, min.node.size=15 
## - Fold3: mtry=5, splitrule=gini, min.node.size=15 
## + Fold3: mtry=6, splitrule=gini, min.node.size=15 
## - Fold3: mtry=6, splitrule=gini, min.node.size=15 
## + Fold3: mtry=7, splitrule=gini, min.node.size=15 
## - Fold3: mtry=7, splitrule=gini, min.node.size=15 
## + Fold4: mtry=5, splitrule=gini, min.node.size=10 
## - Fold4: mtry=5, splitrule=gini, min.node.size=10 
## + Fold4: mtry=6, splitrule=gini, min.node.size=10 
## - Fold4: mtry=6, splitrule=gini, min.node.size=10 
## + Fold4: mtry=7, splitrule=gini, min.node.size=10 
## - Fold4: mtry=7, splitrule=gini, min.node.size=10 
## + Fold4: mtry=5, splitrule=gini, min.node.size=15 
## - Fold4: mtry=5, splitrule=gini, min.node.size=15 
## + Fold4: mtry=6, splitrule=gini, min.node.size=15 
## - Fold4: mtry=6, splitrule=gini, min.node.size=15 
## + Fold4: mtry=7, splitrule=gini, min.node.size=15 
## - Fold4: mtry=7, splitrule=gini, min.node.size=15 
## + Fold5: mtry=5, splitrule=gini, min.node.size=10 
## - Fold5: mtry=5, splitrule=gini, min.node.size=10 
## + Fold5: mtry=6, splitrule=gini, min.node.size=10 
## - Fold5: mtry=6, splitrule=gini, min.node.size=10 
## + Fold5: mtry=7, splitrule=gini, min.node.size=10 
## - Fold5: mtry=7, splitrule=gini, min.node.size=10 
## + Fold5: mtry=5, splitrule=gini, min.node.size=15 
## - Fold5: mtry=5, splitrule=gini, min.node.size=15 
## + Fold5: mtry=6, splitrule=gini, min.node.size=15 
## - Fold5: mtry=6, splitrule=gini, min.node.size=15 
## + Fold5: mtry=7, splitrule=gini, min.node.size=15 
## - Fold5: mtry=7, splitrule=gini, min.node.size=15 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 7, splitrule = gini, min.node.size = 10 on full training set
rf_model_p$results
##   mtry splitrule min.node.size  Accuracy     Kappa      RMSE  AccuracySD
## 1    5      gini            10 0.7209961 0.2850996 0.4248853 0.002461767
## 2    5      gini            15 0.7157715 0.2691237 0.4274002 0.002287683
## 3    6      gini            10 0.7257447 0.3018014 0.4217344 0.002788773
## 4    6      gini            15 0.7208863 0.2868017 0.4244283 0.002329176
## 5    7      gini            10 0.7299683 0.3157844 0.4194954 0.002887538
## 6    7      gini            15 0.7248047 0.2999997 0.4227247 0.002527797
##       KappaSD       RMSESD
## 1 0.006953160 0.0009671948
## 2 0.005814953 0.0007873598
## 3 0.007371201 0.0010379908
## 4 0.005885877 0.0008587080
## 5 0.007161393 0.0005065771
## 6 0.006734663 0.0008682306
best_mtry <- rf_model_p$bestTune$mtry
best_min_node_size <- rf_model_p$bestTune$min.node.size

# Get average (ie over the folds) RMSE and AUC ------------------------------------
CV_RMSE_folds[["rf_p"]] <- rf_model_p$resample[,c("Resample", "RMSE")]

auc <- list()
for (fold in c("Fold1", "Fold2", "Fold3", "Fold4", "Fold5")) {
  cv_fold <-
    rf_model_p$pred %>%
    filter(Resample == fold)

  roc_obj <- roc(cv_fold$obs, cv_fold$X1)
  auc[[fold]] <- as.numeric(roc_obj$auc)
}
CV_AUC_folds[["rf_p"]] <- data.frame("Resample" = names(auc),
                                         "AUC" = unlist(auc))

CV_RMSE[["rf_p"]] <- mean(CV_RMSE_folds[["rf_p"]]$RMSE)
CV_AUC[["rf_p"]] <- mean(CV_AUC_folds[["rf_p"]]$AUC)

# Now use loss function and search for best thresholds and expected loss over folds -----
best_tresholds_cv <- list()
expected_loss_cv <- list()

for (fold in c("Fold1", "Fold2", "Fold3", "Fold4", "Fold5")) {
  cv_fold <-
    rf_model_p$pred %>%
    filter(mtry == best_mtry,
           min.node.size == best_min_node_size,
           Resample == fold)

  roc_obj <- roc(cv_fold$obs, cv_fold$X1)
  best_treshold <- coords(roc_obj, "best", ret="all", transpose = FALSE,
                          best.method="youden", best.weights=c(cost, prevelance))
  best_tresholds_cv[[fold]] <- best_treshold$threshold
  expected_loss_cv[[fold]] <- (best_treshold$fp*FP + best_treshold$fn*FN)/length(cv_fold$X1)
}

# average
best_tresholds[["rf_p"]] <- mean(unlist(best_tresholds_cv))
expected_loss[["rf_p"]] <- mean(unlist(expected_loss_cv))


rf_summary <- data.frame("CV RMSE" = CV_RMSE[["rf_p"]],
                         "CV AUC" = CV_AUC[["rf_p"]],
                         "Avg of optimal thresholds" = best_tresholds[["rf_p"]],
                         "Threshold for Fold5" = best_treshold$threshold,
                         "Avg expected loss" = expected_loss[["rf_p"]],
                         "Expected loss for Fold5" = expected_loss_cv[[fold]])

kable(x = rf_summary, format = "html", booktabs=TRUE,  digits = 3, row.names = TRUE,
      linesep = "", col.names = c("CV RMSE", "CV AUC",
                                  "Avg of optimal thresholds","Threshold for fold #5",
                                  "Avg expected loss","Expected loss for fold #5"))
CV RMSE CV AUC Avg of optimal thresholds Threshold for fold #5 Avg expected loss Expected loss for fold #5
1 0.419 0.818 0.55 0.545 0.985 0.971
# Create plots - this is for Fold5

createLossPlot(roc_obj, best_treshold, "rf_p_loss_plot")

createRocPlotWithOptimal(roc_obj, best_treshold, "rf_p_roc_plot")

# Take model to holdout and estimate RMSE, AUC and expected loss ------------------------------------

rf_predicted_probabilities_holdout <- predict(rf_model_p, newdata = data_holdout, type = "prob")
data_holdout$rf_p_prediction <- rf_predicted_probabilities_holdout[,"X1"]
RMSE(data_holdout$rf_p_prediction, data_holdout$fast_growth)
## [1] 0.4130405
# ROC curve on holdout
roc_obj_holdout <- roc(data_holdout$fast_growth, data_holdout[, "rf_p_prediction", drop=TRUE])

# AUC
as.numeric(roc_obj_holdout$auc)
## [1] 0.8495307
# Get expected loss on holdout with optimal threshold
holdout_treshold <- coords(roc_obj_holdout, x = best_tresholds[["rf_p"]] , input= "threshold",
                           ret="all", transpose = FALSE)
expected_loss_holdout <- (holdout_treshold$fp*FP + holdout_treshold$fn*FN)/length(data_holdout$fast_growth)
expected_loss_holdout
## [1] 0.976952
# Confusion table on holdout with optimal threshold
holdout_prediction <-
  ifelse(data_holdout$rf_p_prediction < best_tresholds[["rf_p"]], "X0", "X1") %>%
  factor(levels = c("X0", "X1"))
cm_object4 <- confusionMatrix(holdout_prediction, data_holdout$fast_growth_f)
cm4 <- cm_object4$table
cm4
##           Reference
## Prediction    X0    X1
##         X0 12920  5538
##         X1   377  1644
# Summary results ---------------------------------------------------

nvars[["rf_p"]] <- length(rfvars)

summary_results <- data.frame("Number of predictors" = unlist(nvars),
                              "CV RMSE" = unlist(CV_RMSE),
                              "CV AUC" = unlist(CV_AUC),
                              "CV threshold" = unlist(best_tresholds),
                              "CV expected Loss" = unlist(expected_loss))

model_names <- c("Logit X1", "Logit X6",
                 "Logit LASSO","RF probability")
summary_results <- summary_results %>%
  filter(rownames(.) %in% c("X1", "X6", "LASSO", "rf_p"))
rownames(summary_results) <- model_names

kable(x = summary_results, format = "html", booktabs=TRUE,  digits = 3, row.names = TRUE,
      linesep = "", col.names = c("Number of predictors", "CV RMSE", "CV AUC",
                                  "CV threshold", "CV expected Loss"))
Number of predictors CV RMSE CV AUC CV threshold CV expected Loss
Logit X1 18 0.475 0.554 0.549 1.051
Logit X6 115 0.463 0.636 0.721 1.049
Logit LASSO 102 0.463 0.616 0.722 1.051
RF probability 30 0.419 0.818 0.550 0.985